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Abstract 

Nearest neighbor (fc-NN) graphs are widely used 
in machine learning and data mining applica- 
tions, and our aim is to better understand what 
they reveal about the cluster structure of the un- 
known underlying distribution of points. More- 
over, is it possible to identify spurious structures 
that might arise due to sampling variability? 

Our first contribution is a statistical analysis that 
reveals how certain subgraphs of a fc-NN graph 
form a consistent estimator of the cluster tree of 
the underlying distribution of points. Our sec- 
ond and perhaps most important contribution is 
the following finite sample guarantee. We care- 
fully work out the tradeoff between aggressive 
and conservative pruning and are able to guar- 
antee the removal of all spurious cluster struc- 
tures at all levels of the tree while at the same 
time guaranteeing the recovery of salient clus- 
ters. This is the first such finite sample result in 
the context of clustering. 

1. Introduction 

In this work, we consider the nearest neighbor (fc-NN) 
graph where each sample point is linked to its nearest 
neighbors. These graphs are widely used in machine learn- 
ing and data mining applications, and interestingly there 
is still much to understand about their expressiveness. In 
particular we would like to better understand what such a 
graph on a finite sample of points might reveal about the 
cluster structure of the underlying distribution of points. 
More importantly we are interested in whether one can 
identify spurious structures that are artifacts of sampling 
variability, i.e. spurious structures that are not representa- 
tive of the true cluster structure of the distribution. 

Our first contribution is in exposing more of the richness 
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Figure 1. A density / (black line) and its cluster tree (dashed). 
The CCs of 3 level sets are shown in lighter color at the bottom. 

of fc-NN graphs. Let G„ be a fc-NN graph over an n- 
sample from a distribution T with density /. Previous work 
(Maier et al., 2009) has shown that the connected compo- 
nents (CC) of a given level set of / can be approximated 
by the CCs of some subgraph of Gn, provided the level 
set satisfies certain boundary conditions. However it re- 
mained unclear whether or when all level sets of / might 
satisfy these conditions, in other words, whether the CCs 
of any level set can be recovered. We show under mild as- 
sumptions on / that CCs of any level set can be recovered 
by subgraphs of G„ for n sufficiently large. Interestingly, 
these subgraphs are obtained in a rather simple way: just 
remove points from the graph in decreasing order of their 
fc-NN radius (distance to the fc'th nearest neighbor), and we 
obtain a nested hierarchy of subgraphs which approximates 
the cluster tree of F, i.e. the nested hierarchy formed by 
the level sets of / (see Figure 1, also Section 2.1). 

Our second, and perhaps more important contribution is 
in providing the first concrete approach in the context of 
clustering that guarantees the pruning of all spurious clus- 
ter structures at any tree level. We carefully work out the 
tradeoff between pruning "aggressively" (and potentially 
removing important clusters) and pruning "conservatively" 
(with the risk of keeping spurious clusters) and derive tun- 
ing settings that require no knowledge of the underlying 
distribution beyond an upper bound on /. We can thus 
guarantee in a finite sample setting that (a) all clusters re- 
maining at any level of the pruned tree correspond to CCs 
of some level set of /, i.e. all spurious clusters are pruned 
away, and (b) salient clusters are still discovered, where the 
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degree of saliency depends on the sample size n. We can 
show furthermore that the pruned tree remains a consistent 
estimator of the underlying cluster tree, i.e. the CCs of any 
level set of / are recovered for sufficiently large n. In- 
terestingly, the pruning procedure is not tied to the fc-NN 
method, but is based on a simple intuition that can be ap- 
plied to other cluster tree methods (see Section 3). 

Our results rely on a central "connectedness" lemma (Sec- 
tion 5.2) that identifies which CCs of / remain connected 
in the empirical tree. This is done by analizing the way in 
which fc-NN radii vary along a path in a dense region. 

1.1. Related work 

Recovering the cluster tree of the underlying density is 
a clean formalism of hierarchical clustering proposed in 
1981 by J. A. Hartigan (Hartigan, 1981). Hartigan showed 
in the same seminal paper that the single-linkage algorithm 
is a consistent estimator of the cluster tree for densities on 
M. For M"^, d > 1 it is known that the empirical cluster 
tree of a consistent density estimate is a consistent estima- 
tor of the underlying cluster tree (see e.g. (Wong & Lane, 
1983)), unfortunately there is no known algorithm for com- 
puting this empirical tree. Nonetheless, the idea has led to 
the development of interesting heuristics based on first es- 
timating density, then approximating the cluster tree of the 
density estimate in high dimension (Wong &Lane, 1983; 
Stueltze & Nugent, 2010). 

Many other related work such as (Rigollet & Vert, 
2009; Singh etal., 2009; Maieretal., 2009; 
Rinaldo & Wasserman, 2010) consider the task of re- 
covering the CCs of a single level set, the closest to the 
present work being (Maier et al., 2009) which uses a fc-NN 
graph for level set estimation. As previously discussed, 
level set estimation however never led to a consistent 
estimator of the cluster tree, since these results typically 
impose technical requirements on the level set being recov- 
ered but do not work out how or when these requirements 
might be satisfied by all level sets of a distribution. 

A recent insightful paper of Chaudhuri & Dasgupta (2010) 
presents the first provably consistent algorithm for estimat- 
ing the cluster tree. At each level of the empirical clus- 
ter tree, they retain only those samples whose fc-NN radii 
are below a scale parameter r which indexes the level; 
CCs at this level are then discovered by building an ? - 
neighborhood graph on the retained samples. This is simi- 
lar to an earlier generalization of single-linkage by Wishart 
(1969) which however was given without a convergence 
analysis. The fc-NN tree studied here differs in that, at 
an equivalent level r, points are connected to the subset of 
their fc-nearest neighbors retained at that level. One prac- 
tical appeal of our method is its simplicity; we need only 
remove points from an initial fc-NN graph to obtain the var- 



ious levels of the empirical cluster tree. 

(Chaudhuri Si Dasgupta, 2010) provides finite sample re- 
sults for a particular setting of fc « log n. In contrast our 
finite sample results are given for a wide range of values of 
fc, namely for log n < k < n^/'-''''^\ In both cases the finite 
sample results establish natural separation conditions un- 
der which the CCs of level sets are recovered (see Theorem 
1). The result of (Chaudhuri & Dasgupta, 2010) however 
allows the possibility that some empirical clusters are just 
artifacts of sampling variability. We provide a simple prun- 
ing procedure that ensures that clusters discovered empiri- 
cally at any level correspond to true clusters at some level 
or the underlying cluster tree. Note that this can be triv- 
ially guaranteed by returning a single cluster at all levels, 
so we additionally guarantee that the algorithm discovers 
salient modes of the density, where the saliency depends 
on empirical quantities (see Theorem 2). 

A recent archived paper (Rinaldo et al., 2010) also treats 
the problem of false clusters in cluster tree estimation, but 
the result is not algorithmic as they only consider the clus- 
ter tree of an empirical density estimate, and do not provide 
a way to compute this cluster tree. 

There exist many pruning heuristics in the literature which 
typically consist of removing small clusters (Maier et al., 
2009; Stueltze & Nugent, 2010) using some form of thresh- 
olding. The difficulty with these approaches is in how to 
define small without making strong assumptions on the un- 
known underlying distribution, or on the tree level being 
pruned (levels correspond to different resolutions or cluster 
sizes). Moreover, even the assumption that spurious clus- 
ters must be small does not necessarily hold. Consider for 
example a cluster made up of two large regions connected 
by a thin bridge of low mass; the two large regions can eas- 
ily appear as two separate clusters in a finite sample. Some 
more sophisticated methods such as (Stueltze & Nugent, 
2009) do not rely on cluster size for pruning, instead they 
return confidence values for the empirical clusters based 
on various notions of cluster stability; unfortunately they 
do not provide finite sample guarantees. Our pruning guar- 
antees the removal of all spurious clusters, large and small 
(see Figure 2); we make no assumption on the shape of 
clusters beyond a smoothness assumption on the density; 
we provide a simple tuning parameter whose setting re- 
quires just an upper bound on the density. 

2. Preliminaries 

Assume the finite dataset X = {^1}"=! drawn i.i.d. 
from a distribution F over with density function /. 

We start with some simple definitions related to fc-NN oper- 
ations. All balls, unless otherwise specified, denote closed 
balls in W^. 
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Definition 1 (fc-NN radii). For x E X, let rk,n{x) denote 
the radius of the smallest ball centered at x containing k 
points from X \ {x}. Also, let rk{x) denote the radius of 
the smallest ball centered at x of J--mass k/n. 
Definition 2 (fc-NN and mutual fc-NN graphs). The k- 
NN graph is that whose vertices are the points in X, and 
where Xi is connected to Xj iff Xi € B(Xj,drk{Xj)) 
or Xj g B{Xi,9rk{Xi)) for some 9 > 0. The mu- 
tual k-NN graph is that where Xi is connected to Xj iff 
X, e B{Xj,erk{X,)) andXj G B{X„9rk{X,)). 

2.1. Cluster tree 

Definition 3 (Connectedness). We say A C M.'^ is con- 
nected if for every x,x' € A there exists a continuous 1 — 1 
function P : [0, 1] i— >■ A where P{0) = x and P(l) = x' . 
P is called a path in A between x and x' . 

The cluster tree of / will be denoted {G(X)] y^^, where 
G{X) are the CCs of the level set {x : f{x) > A}. Notice 
that {G(A)}^^Q forms a (infinite) tree hierarchy where for 
any two components A, A', either A n A' = or one is a 
descendant of the other, i.e A C A' or A' C A. 

3. Algorithm 

Definition 4 (fc-NN density estimate). Define the density 
estimate at x E M'* as : 

fix) ^ - = - , 

n ■ vo\{B{x,rk^n{x))) n ■ Vdr'^^^{x)' 

where is the volume of the unit ball in R*^. 

Let Gn be the A;-NN or mutual fc-NN graph. For A > 
define G„(A) as the subgraph of G„ containing only ver- 
tices in {Xi : fn{Xi) > A} and corresponding edges. The 
CCs of {GnW} xyo form a tree: let An and A'^ be two 
such CCs, either An H A'^^ = or one is a descendant of 
the other, i.e. An is a subgraph of A'^ or vice versa. To 
simplify notation, we let the set {G„(A)};^^g denote the 
empirical cluster tree before pruning. 

Pruning 

The pruning procedure (Algorithm 1) consists of simple 
lookups: it reconnects CCs at level A if they are part of the 
same CC at level A — e where the tuning parameter e > 
controls how aggressively we prune. We show its behavior 
on a finite sample in Figure 2. 

The intuition behind the procedure is the following. Sup- 
pose An, A'n C X are disconnected at some level A in the 
empirical tree before pruning. However, they ought to be 
connected, i.e. their vertices belong to the same CC A at 
the highest level where they are all contained in the under- 
lying cluster tree. Then, key sample points from A that 




Figure 2. Pruning at work: it reconnects CCs independent of size. 
The dashed lines are reconnection edges from pruning. Shown 
are two levels of the fc-NN tree of a 500-sample from the 2-modes 
mixture 0.5AA([0, 0], /2) + 0.5AA([l, 4], /a). Here fc = 12,5 = 1, 
e = F/Vk where F = 2.73 is the maximum /„ value. From left 
to right, level A = 0.9 has 72 points, and level A = 1.3 has 33. 

would have kept them connected are missing at level A in 
the empirical tree. These key points have /„ values lower 
than A, but probably not much lower By looking down to 
a lower level near A we find that An , A[^ are connected and 
thus detect the situation. Notice that this intuition is not 
tied to the fc-NN cluster tree but can be applied to any other 
cluster tree procedure. All that is required is that all points 
from A (as discussed above) be connected at some level in 
the tree close to A. 



Algoritlim 1 Prune G„(A) 

Given: tuning parameter e > 0, same for all levels. 
G„(A) ^ G„(A). 

if A > e then _ 

Connect components A„,A'„ of G„(A) if they are part of 

the same component of Gn{X — e). 
else _ 

Connect all G'„(A). 
end if 



It is not hard to see that the CCs of the pruned subgraphs 

< G„(A) \ still form a tree. We will hence denote the 
L J A>0 

pruned empirical tree by { G„(A) > 

I- J A>0 

4. Results Overview 

We make the following assumptions on the density /. 
(A.l) 3F > 0, sup^gR. /(x) < i^. 

(A. 2) / is Hoelder-continuous, i.e. there exists L,a > 
such that for all x, x' £ R'^, 

\f{x)-fix')\<L\\x^xT- 

Theorem 1 below is a finite sample result that establishes 
conditions under which samples from a connected subset 
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of remain connected in the empirical cluster tree, and 
samples from two disconnected subsets of remain dis- 
connected even after pruning. Essentially, for k sufficiently 
large, points from connected subsets A remain connected 
below some level. Also, provided k is not too large, disjoint 
subsets A and A' which are separated by a large enough 
region of low density (relative to n, k and e), remain dis- 
connected above some level. 

We require the following two definitions. 

Definition 5 (Envelope of ^ C R''). Let A C R'^ and for 

r > 0, define: A+r = {y '■ <E A,y <E B{x, r)} . 

Definition 6 ((e, r)-separated sets ). A, A' C M'* are (e, r)- 

separated if there exists a separating set S such that every 
path in between A and A' intersects S, and 

sup f{x) < inf f{x) ~ e. 

Tlieorem 1. Suppose f satisfies (A.l) and (A.2). Let Gn 
be the k-NN or mutual k-NN graph. Let S > and define 
Ck = nF^/\ii{2n/S)/k. There exist C and C = C'{T) 
such that, for 

C (max|l,\/2/6l|)'^dhi(n/(5) 

/ , \2(a+d)/(3a+d) 

<k<C' (f^M^V^) „2o/(3o+d) (1) 

the following holds with probability at least 1 — 36 simul- 
taneously for subsets A ofW^. 

(a) Let A be a connected subset of W^, and let A = 
inixeA fi^) > 2efc. All points /n A H X belong to 
the same CC o/G'„(A — 2efc). 

(b) Let A and A' be two disjoints subsets ofM.'^, and define 
X = inix^AuA' fix)- Recall that e > is the tuning 
parameter Suppose A and A' are (e, r)-separated for 
e = 6efc + 2e and r = f {'ik/vduXf''^. Then A n X 
and A' n X are disconnected in Gn{X — 2efe). 

Theorem 1 above, although written in terms of G„, applies 
also to Gn by just setting e = 0. The theorem implies 
consistency of both pruned and unpruned fc-NN trees un- 
der mild additional conditions. Some such conditions are 
illustrated in the corollary below. A nice practical aspect of 
the pruning procedure is that consistency is obtained for a 
wide range of settings of e and k as functions of n. 

Corollary 1 (Consistency). Suppose that f satisfies (A.l) 
and (A.2) and that, in addition, J- is supported on a com- 
pact set, and for any A > 0, there are finitely many compo- 
nents in G{X). Assume that, as n —> oo, e = e(n) — > and 
k/ logn — )■ while k = k{n) satisfies (1). 



For any A C M"^, let An denote the smallest component 
of \ GniX) > containing A H X. Fix X > 0. We have 
lim„^oo P (VA, A' e G{X), An is disjoint from A'^) = 1. 



Proof Let A and A' be separate components of G(A). The 
assumptions ensure that all paths between A and A' tra- 
verse a compact set S satisfying A — max^gs f{x) = 65 > 
(see Lemma 14 of (Chaudhuri & Dasgupta, 2010)). Let 
e = 6efe + 2e and t" = | (Ak/v^nX)^^'^. By uniform conti- 
nuity of /, there exists such that for n > Ni,r is small 
enough so that A — max^^s+r fix) > es/2. Also, there 
exists N2 > Ni such that for n > N2, e < es/2, in other 
words sup^g5^^ f{x) < A - e. 

Since G,i(A) is finite, there exists N such that for n > N, 
all pairs A, A' have a suitable (e, 7-)-separating set S. Thus 
by Theorem 1, for n > N, with probability at least 1 — 36, 
VA, A' € G(A), A n X and A' n X are fully contained in 
G„(A — 2efc) and are disjoint. They are thus disjoint at any 
higher level, so An and A'^ are also disjoint. 

The above holds for all (5 > 0, so the statement follows. □ 

While Theorem 1 establishes that a connected set A re- 
mains connected below some level, it does not guarantee 
against parts of A becoming disconnected at higher levels, 
creating spurious clusters. Note that the removal of spuri- 
ous clusters can be trivially guaranteed by just letting the 
parameter e very large, but the ability of the algorithm to 
discover true clusters is necessarily affected. We are inter- 
ested in how to set e in order to guarantee the removal of 
spurious clusters while still recovering important ones. 

Theorem 2 guarantees that, by setting e as ft{ek) (recall efc 
from Theorem 1), separate CCs of the empirical cluster tree 
correspond to actual clusters of the (unknown) underlying 
distribution, i.e. all spurious clusters are removed. The set- 
ting of e only requires an upper-bound F on the density / ' . 
Note that, under such a setting, consistency is maintained 
per Corollary 1, and in light of Theorem 1 (b), we can ex- 
pect that interesting clusters are discovered. In particular 
the following salient modes of / are discovered. 

Definition 7 ((e, r)-salient mode). An {e,r)-salient mode 
is a leaf node A of the cluster tree {G(A)}_;^^q which has 
an ancestor A^ D A (possibly A itself) satisfying: 

(i) Ak is the ancestor of a single Zea/ 0/ {G(A)}^^g, 
namely A. 

fii) Ak is large: 3x e Ak,B{x, rk{x)) C Ak. 

'We might just use maxig[„] fn(Xi) in practice, which in 
light of Lemma 1 can be a good surrogate for F (see Figure 3). 
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Figure 3. (LEFT). Number of modes (leaves of the empirical tree) 
as we increase e from 0. The trees are built on 500-samples (re- 
sults are averaged over ten such 500-samples) from the 5-modes 
mixture 5^^^^ 0.2A/'(2\/dei, /d), d = 7. Here k = (logn)^■^ 
9=1, and F is the maximum f„ value over the 10 samples. The 
mutual fc-NN tree being more sparse is rather brittle and requires 
more pruning. (RIGHT) We fix e = F/4^/fc, k = (logn)^'^, as 
we increase n. Results are averaged over 10 n-samples for each 
71, and F is again the max /„ value over the 10 samples for each 
n. The fc-NN tree quickly asymptotes at 5 modes. The mutual 
A:-NN being more brittle, we're underpruning for n > 500, i.e. e 
is too small; thus for these settings we would require larger n to 
obtain the correct number of modes. 



fiii) Ai; is sufficiently separated from other components 
at its level: let A = ^i^^xeAk f{^)> <^nd 
{{x : f{x) > A} \ Ak) are (e, r)-separated. 

Notice that, under the assumptions of Corollary 1, every 
mode of / is (e, r)-salient for sufficiently large k and 1/e. 

Theorem 2 (Pruning guarantees). Let 5 > Q. Under the 
assumptions of Theorem 1, the following holds with proba- 
bility at least 1 — 3(5. 

(a) Suppose the tuning parameter I > 3efe. Consider 
two disjoint CCs An and A'^ at the same level in 

\ G„(A) > . Let V be the union of vertices of An 
I J A>0 

and A'n, and define A = inix^v f{x)- The vertices of 
An and those of A'n are in separate CCs of G{X). 



2e and r 



{Ak/vdTiX)^^'^. There 



(b) Let e = 6efc , , — ^ 

exists fl 1 — 1 map from the set of (e, r)-salient modes 

to the leaves of the empirical tree |g„(A)| 



A>0 



The behavior of both the fc-NN and mutual fc-NN tree, as 
guaranteed in Theorem 2, is illustrated in Figure 3. 

5. Analysis 

Theorem 1 follows from lemmas 3 and 6 below. These two 
lemmas depend on the events described by lemmas 1, 2 
and 4 which happen with a combined probability of at least 
1 — 3(5 for a confidence parameter (5 > 0. 



Theorem 2 follows from lemmas 5 and 7 below. These two 
lemmas also depend on the events described by lemmas 1, 
2 and 4 which happen with a combined probability of at 
least 1 — 3(5. 

5.1. Maintaining Separation 

In this section we establish conditions under which points 
from two disconnected subsets of M.'^ remain disconnected 
in the empirical tree, even after pruning. 

The following is an important lemma which establishes the 
estimation error of /„ relative to / on the sample X. In- 
terestingly, although of independent interest, we could not 
find this sort of finite sample statement in the literature on 
fc-NN^, at least not under our assumptions. The proof, pre- 
sented as supplement in the appendix, is a bit involved and 
starts with some intuition from an asymptotic analysis of 
(Devroye & Wagner, 1977) combined with a form of the 
Chernoff bound found in (Angluin & Valiant, 1979). 

Lenuna 1. Suppose f satisfies (A. 1) and (A.2). There exists 
C = C{T) such that for S>0,fore= nF^/hl{2n/6)/k 
and 



1211n(2n/(5) 
<k<C (F^ln(2n/(5) 



2{a+d) / {3a+d) 



2a/{3a+d) 



we have with probability at least 1 — 6 



that 



The next lemma bounds rfc_„(Xi) in terms of rk{Xj), and 
hence, in terms of the density at X;. The proof is provided 
as supplement in the appendix. 

Lemma 2. Suppose f satisfies (A.l) and (A.2). Fix A > 

and let C\ == {x : f{x) > A}. 



(a) Let r ^ i(A/2L)i/". We have \/x,x' G M."^, 
\\x-x'\\ <2r=^ \f{x)-f{x')\ < A/2. If in 
addition x G C\, it follows that f{x)/2 < f{x') < 
2fix). 

(b) Suppose k< 2-(''+3)t;d(2i)-'*/"A(''+")/"n. We have 



Vx € Cx, rk{x) < min \ 2~^/''r, 



2fc 



Vdnf{x) 



For (5 > 0, if in addition k > 1921n(2n/5), we have 
with probability at least 1 — (5 that for all £ X H £a 

The main separation lemma is next. It says that if A and 
A' are separated by a sufficiently large low density region, 
then they remain separated in the empirical tree. 



^There are however many asymptotic analyses of fc-NN meth- 
ods such as (Devroye & Wagner, 1977). 
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Lemma 3 (Separation). Suppose f satisfies (A.l) and 
(A.2). Let Gn be the k-NN or mutual k-NN graph. De- 
fine €k = IIF ^J\n{2n / 5) I k, and let (5 > 0. There exists 
C ~ C {J-) such that, for 



1921n(2n/(S) < k 



<C{Fy^\n{n/d) 



2(a+d}/{3a+d) 



2a/{3a+d) 



the following holds with probability at least 1 — 25 simul- 
taneously for any two disjoint subsets A, A' o/M''. 

Let A = mfxeAuA' f{x). If A and A' are (e, r)-separated 
for e = + 2e andr = | {Ak/vdnXf'^'^, then ^nX and 
A' n X are disconnected in Gn{X — 2ej^ — e) and therefore 
/«G„(A-2e,). 

Proof. Applying Lemma 1, it's immediate that, with prob- 
ability at least 1 — 5, all points of any A U A' n X are in 
G„(A — e/c) and lower levels, and no point from S+r H X 
is in G'„(A — 5efe — 2e) or higher levels. Thus any path 
between A and A' in G'„(A — 26^ — e) must have an edge 
through the center a; G S* of a ball B{x,r) C S+r- This 
edge must therefore have length greater than 2r. We just 
need to show that no such edge exists in G„(A — 2efc — e). 

Let V be the set of points (vertices) in G„ (A — 2ek — e)- By 
Lemma 1, minXiGV /(-^i) ^ A— Se^— e. Given the density 
assumption on S', A > 6efe + 2e so min^iey f{^i) > A/2 
and V C £efc ■ Now, given the range of k. Lemma 2 holds 
for the level set C^^ . It follows that with probability at least 
1 — 5 (uniform over any such choice of A, A' since the event 
is a function of C^^.), 

max rkniXi) < 2^^'^ max rkiXA < 

Thus, edge lengths in G„(A — 2efc — e) are at most 2r. □ 

5 . L L Identifying Modes 

As a corollary to Lemma 3, we can guarantee in Lemma 
5 that certain salient modes are recovered by the empirical 
cluster tree. For this to happen, we require in Definition 
7 (ii) that an (e, r)-salient mode A is contained in a suffi- 
ciently large set Ak so that we sample points near the mode. 

We start with the following VC lemma establishing condi- 
tions under which subsets of M'' contain samples from X. 

Lemma 4 (Lemma 5.1 of (Bousquet et al., 2004)), Sup- 
pose C is a class of subsets o/R'*. Let Sc{2n) denote the 
2n-shatter coefficient of C. Let J-'n denote the empirical 
distribution over n samples drawn i.i.dfrom T. For 5 > 0, 
with probability at least 1 — 5, 



sup 

Aec 



F{A)-Fn{A) 



< 2 



log 5c (2n) + log 4/(5 



Lemma 5 (Modes). Suppose f satisfies (A.l) and (A.2). 
Let Gn be the k-NN or mutual k-NN graph. Let 5 > 0. 
There exist G and C' — G' (F) such that, for 



Gd\n{n/5) 



<k<G' (j^v^ln(n/<5)) 



2(Q+d)/(3Q+rf) 



2a/{3a+d) 



the following holds with probability at least 1 — 35. Let 
e = 6efc + 2e and ?' = f i'lk/vdnX)^^'^. There exists a 
1 — 1 map from the set of (e, r)-salient modes to the leaves 

of the empirical tree \ G,i(A) > 

L J A>0 

Proof. First, with probability at least 1 — 5, for any (e, r)- 
salient mode A, there are samples in X from the containing 
set Ak (as defined in Definition 7). To arrive at this we ap- 
ply Lemma 4 for the class C of all possible balls B £ W^, 
(for this class 5c (2n) < (2n)'^+^). We have with probabil- 
ity at least 1 — 5 that for all B, Fn{B) > whenever 

> CdHn/5) ^ ^ (d+l)log(2n)+log(4/5) 
~ n n ' 

where G is appropriately chosen to satisfy the last in- 
equality. Now, from the definition of Ak, there ex- 
ists X such that B{x,rk{x)) C Ak, while we have 
F{B{x,rk{x))) = k/n > G(ilii(7i/(5)/n, implying that 
FniAk) > Fn{B{x,rk{x))) > 1/n. 

As a consequence of the above argument, there is a finite 
number m of (e, r)-salient modes since each contributes 
some points to the final sample X. We can therefore ar- 
range them as {A*}^ .^^ so that for i < j, we have A; < Xj 
where A^ = inf^g^i f{x). An injective map can now be 
constructed iteratively as follows. 

Starting with i = 1, we have by Lemma 3 that, with 
probability at least 1 — 2(5, n X is disconnected in 
Gn{Xi — 2ek) from all j > i. Let U be the union of 
those CCs of Gn [Xi — 2ek) containing points from Al. n X. 
We've already established that U contains no point from 
any A],, j > i. For i > 1, U also contains no point from 
any Al,,j < i. This is because, again by Lemma 3, Aj.D'X. 
is disconnected in Gn{Xj — 2ek) from A^ D X, therefore 
disconnected from U since all CCs in U remain connected 
at lower levels. Now, since U is disconnected from all 
A^, j 7^ i, we can just map A* to any leaf rooted in U, 
A^ being the unique image of such a leaf. □ 

5.2. Maintaining Connectedness 

In this section we show that sample points from a connected 
subset A of remain connected in the empirical cluster 
tree before pruning (therefore also after pruning). 

Similar to (Chaudhuri <& Dasgupta, 2010), for any two 
points x,x' e A n X we uncover a path in G„ near 
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a path P in yl that connects the two. The path in G„ 
(the dashed path depicted below) consists of a sequence 
xi = x,X2, ■ ■ ■ ,Xi = x' of sample points from balls cen- 
tered on the path P in A (the solid path depicted below). 
The intuition is that P is a high density route near which 
we can find enough sample points to connect x and x'. 




The balls centered on P must be chosen sufficiently small 
and consecutively close so that consecutive terms Xi, x-^+i 
are adjacent in G„. In (Chaudhuri & Dasgupta, 2010), 
points are adjacent (at any particular level) whenever they 
are less than some scale r apart; one can therefore choose 
balls of the same radius o(r) and consecutively o(r) close. 
In our particular case, no single scale determines adjacency. 
Adjacency is determined by the various nearest-neighbor 
radii and this creates a multiscale effect that complicates 
the analysis. One way to handle (and effectively get rid of) 
this multiscale effect is to choose balls on P of the same 
radius r corresponding to the smallest possible nearest- 
neighbor radius in Gn (restricted to AnX). However, in or- 
der to get samples in such small balls one would need rather 
large sample size n, so the idea results in weak bounds. 
We instead use an inductive argument which keeps track of 
the various scales, the intuition being that nearest-neighbor- 
radii have to change slowly along the path P from xtox'. 

Lemma 6 (Connectedness). Suppose f satisfies (A.l) and 
(A.2). Let Gn be the k-NN or mutual k-NN graph. Define 
€k IIF ^\x\{2n / 5) / k and let (5 > 0. There exist C and 
G' = G' {F) such that, for 

G {xnia^\\,^/2|eyj^ d\n{n/5) 

<k<G' (pVln(n/5)) 

the following holds with probability at least 1 — 3(5 simul- 
taneously for all connected subsets A ofR'^. 

Let X == inixeA f{x) > 2efe. All points /« A n X belong to 
the same CC ofGn {X — 2ek), therefore o/G'„(A — 2ek). 

Proof. First, let G and C" be large enough for lemmas 1 
and 2 to hold. Define r = i (efc/2i)^/". By Lemma 2 (a), 
we have that f{x) > A — efe/2 for any x S A+r- Applying 
Lemma 1, it follows that with probability at least 1 — S 
(uniform over choices of A), all points of A^,. D X are in 
G„(A — 2efc). We will show that A n X is connected in 
G„(A — 2ek) possibly through points in A+r \ A. 

In particular, any x,x' E AO X. are connected through a 



2(a+d) / (3a+d) 



2a/{3a+d) 



following procedure. Let P be a path in A between x and 
x'. Define r = min {l, 6'/V2}. 

Starting at i = 1 (xi = x), set Xj+i = x' 
if ll^i - x'll < dmm{rk^nixi),rk,nix')}, and 
we're done, otherwise: 

Let Ui be the point in PnP (^Xi,T2~^/'^rk,nixi)) 
farthest along the path P from x, i.e. P~^{yi) is 
highest in the set. Define the half -ball 

H{y,) = {z : \\z - y\\ < T2-i«/V.„(a;.), 
(z - yj) • (x, - y^) > 0}. 

Pick Xi+i in H{yi) n X, and continue. 

The rest of the argument will proceed inductively as fol- 
lows. First, assume that Xi E A+r and that yi exists. This 
is necessarily the case for xi, yi. Assume Xi+i ^ x' . We 
will show that x^+i exists, is also in A+r, and is adjacent to 
Xi in Gn- It will follow that y^+i must exist (if the process 
does not end) and is distinct from yi, . . . , y^. We'll then 
argue that the process must also end. 

To see that Xj+i exists (under the aforementioned assump- 
tions), we apply Lemma 4 for the class C of all possi- 
ble half-balls H{y) centered at y e M'' (for this class 
Sc{2n) < (2n)^''+^). We have with probability at least 
1-6 that for all H{y), J'„(iJ(y)) > whenever 



HH{y)) > 



Godln(f) (8d + 4)log(2,^)+41og(|) 



> 



where Go is appropriately chosen to satisfy the last inequal- 
ity. We next show F{H{yi)) satisfies the first inequality. 

We first apply Lemma 2 on C^^. D A+r (this inclusion 
was established earlier). We have with probability at least 
\ — 5 (uniform over all A) that for xi e A+r, rk,nixi) < 
2'^/'^rk{xi) < r. Thus, for all z e H{y,), 



\Z-X^\\ < 2-T2'9/d^fe,„(.T,;) 

< 2 • T2~^/'^r < 2r, 



(2) 



sequence {xj}. 



>i 



€ A+r n X built according to the 



implying by the same Lemma 2 that /(z) > f{xi)/2. Now, 
from Lemma 1, fn{xt) < f{xi) + Ck < 2f{xi). We can 
thus write 

^iH{y,)) > ivol (5(y„r2-i8/<^rfe,„(x,))) f{x^) 

^r^2~^"Y0l{B{x,,rk,n{x,)))f{x,) 
> r'^2-21 vol (P(a;„ rfc,„(x,))) /„(x,) 

^.■^2-^ife>^°^W'^),forG>2^^Go. 
n n 

Therefore there is a point Xi+i in H{yi) n X. In addition 
Xi+i E vl+r since it is within r of yi G A. 
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Next we establish that there is an edge between Xi and Xi+i 
in Gn- To this end we relate rk.n{xi+i) to rk,n{xi) by first 
relating rk{xi-^i) to rk{xi). Remember that for z e 
we have rfe(z) < r so that for any z' G B{z,rk{z)) 
we have f{z)/2 < f{z') < 2/(z). Also recall that 
we always have — < 2r (see (2)), implying 
f{xi+i) < 2f{xi). We then have 

1 k 

Vdriixi) ■ -f{xi) <-< Vdriixi+i) ■ 2f{x,+i) 

< Vd4{x,+i)-4f{x,), 

where for the first two inequalities we used the fact that 
both balls B{xi,rk{xi)) and B{xi-^i,rk{xi^i)) have the 
same mass k/n. It follows that 

rkA^^+l) > 2^3/<i^fc(a;,+i) > 2"«/V(a;.) 

> 2~9/V,«(a;,), (3) 

implying 2~3/''rfc,„(xi) < inm{rk,n{xi),rk^n{xi+i)}. 
We then get 

\\xi - Xi+i\\^ = \\x.i - y.iW'^ + llx-j+i - t/jII^ 
- {xi - Vi) ■ {xi+i ~ yi) 

< \\xi - ViW^ + \\xt+i - VtW^ 

< 2t^ ■ mm{rl,^{x.,),rl,^{xi+i)} 

< mill {r^,„(xi), rl „{x^+i)} , 
meaning Xi and x^+i are adjacent in Gn- 

Finally we argue that y^+i must exist. By (3) above we 
have 

\\x,+i - y,\\ < r2-i8/''rfc,„(x,) < r2-9/V,«(x.+i), 

in other words the ball B (x^+i, r2~^/''rfc „(xi+i)) con- 
tains yi G P in its interior. It follows by continuity of P 
that there is a point j/i+i in this ball further along the path 
from Xi than yi. Thus, recursively all y/s must be dis- 
tinct, implying that all .t^'s must be distinct. Since all x^'s 
belong to the finite sample X the process must eventually 
terminate. □ 

5.2.1. Pruning of Spurious Branches 

As a corollary to Lemma 6 we can guarantee in Lemma 7 
that the pruning procedure will remove all spurious branch- 
ings, and hence, all spurious clusters. 

Lemma 7 (Pruning). Let 6 > 0. Under the assumptions 
of Lemma 6, the following holds with probability at least 
1 — 3(5, provided e > Se^. 

Consider two disjoint CCs An and A'^ at the same level in 

{ GnM f ■ Let V be the union of vertices of An and 
L J A>0 

A'^, and define A = inix^v fix)- The vertices of An and 
those of A'j^ are in separate CCs ofG{X)- 



Proof Let A„ = miux ev fn{x) be the level in the 
empirical tree containing An,A'^. By Lemma 1, 
supjjgx - /(^)l < ek SO Xn < A + efe. Thus, 

we must have A > 2efc, since otherwise A„ < ? implying 
G,i(A„) must have a single connected component. 

Now suppose points in V were in the same component A of 
G(A). By Lemma 6, all of ^ n X is connected in G„(A — 
2ek ) and at lower levels. By the last argument A„ — e < A — 
2ek so the pruning procedure reconnects An and A'^. □ 
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Appendix 

A. Proof of Lemma 1 

Lemma 1 follows as a corollary to Lemma 9 below. 

We'll often make use of the following form of the Chernoff 
bound. 

Lemma 8 ((Angluin & VaHant, 1979)). Let N - 

Bin(n,p). Then for all < < < 1, 

P (AT > (1 + t)np) < cxp , 
P (iV < (1 - t)np) < exp . 

Lemma 9. Suppose the density function f satisfies: 

(a) f is uniformly continuous on W^. In other words, Ve > 
0, 3c^ s.t. for all balls B where vol {B) < we have 

|/(x)-/(x')|<e/2. 

(b) 3F, sup^gRd f{x) = F. 

Fix < e < F, let n > 2, and k < n. Ifk/ne < Cc/4 then 
sup^ \f{X,) - /„(X0| > < 2nexp {-^^^ ■ 



Proof. We'll be using the short-hand notation Bk.n{x) 
B{x, Tk^n [x)) for readability in what follows. 

We start with the simple bound: 



sup >e 

<P(3X, eX, /„(X,)>/(X,) + 6)- 
P(3X, gX, /„(X,)</(XO-e) 

eX, YO\{Bk,n{X^)) < 



k 



nifiX,) + e) 
3X, e X,/(X,) > e, Yo\{BkAXi)) > 



(4) 



where the inner probability is over the choice of X \ 
{Xi = x} for i fixed. In what follows we use the nota- 
tion Fn-i to denote the empirical distribution over X \ 

{X, = x). 

Assume vol (Bfc^n (a;)) < k/n{f{x) + e) < fc/ne < Ce. 
Then by the uniform continuity assumption on / we have 

F{BkAx))<{f{x)+e/2) 











/ n 





n{f{x) + e) 
k 
n 



Now let B{x) be the ball centered at x with J^-mass 
(1 - e/4F) {k/n). Since by the above, F (Bk^nix)) < 
F {B{x)), we also have that Fn {Bk,n{x)) < Fn {B{x)). 
This implies that 

FiB{x)) < (l - ^) = (i _ _L) ^„ (B, „(,)) 

In other words, let t = e/(4F — e), applying the Chernoff 
bound of Lemma 8, we have 

P(vol(Sfc,„(x)) <fc/7i(/(x) + e)) 
<V{Fn{B{x)) > {l + t)F(B(x))) 
< exp (-<^(n - l)F iB{x)) /3) < exp {-e^k/96F^) . 

Combine with (6) to complete the bound on (4). 

We now turn to bounding (5). We proceed as before by 
fixing i and integrating over Xi — x where f{x) > e, that 
is 



3X, e X, f{X,) > e, vol(Bfc,„(XO) > 



nifiX,)-e) 



<n I P vol(Bfc,„(.T)) > — — ^ 



dF{x), 
(7) 



n{f{X,) - e) 
(5) 



We handle (4) and (5) by first fixing i and conditioning on 
= We start with (4): 



where again the probability is over the choice of X \ 
{Xi ~ a;}. Now, we can no longer infer how much / de- 
viates within Bk,n{x) from just the event in question (as 
we did for the other direction). The trick (inspired by 
(Devroye & Wagner, 1977)) is to consider a related ball. 

Let B(x) be the ball centered at x of volume k/n{f{x) — 
3e/4). Then 



eX, vol(Sfc,„(X,))< 
<n ( P ( vol(Bfc,„(x)) < 



n{f{Xi) + e) 
k 



vol(Bfc,„(a;)) > 



k 



n{fix) - e) 



> vol(S(a;)) 



n(/(x) + e) 



dF{x), (6) 



F„^.iBix))<--^<-. 
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Since vol(i?(a;)) < 4fc/e < c^, we have by the uniform 
continuity of / that 

> (l + ^) -^n-i {B{x)) . 
we thus have for t = e/{AF + e), and using Lemma 8 that 

k 

VO\{Bk.n{x)) > 



<V{Fn{B{x)) < {\ - t)F {B{x))) 

< cxp {-e^k/120F^) . 

Combine with (7) to complete the bound on (5). 

The final result is proved by then combining the bounds on 
(4) and (5). □ 

Proof of Lemma 1. For any < e < 1, let = 
Vd2-'^ {e/2Lf°' so that whenever for balls B, vol (B) < 
Ce, the radius r of i? is less than ^ (e/2L)^^". Thus, 
sup,,,,eB \f{x) - < L{2rr < e/2. Now, for the 

settings of e and k in the lemma statement, we have 



< e < F and — < Vd2-'^ ( — 1 



4k 

ne 



\2LJ 



SO we can apply Lemma 9 to get 



^sup^|/(X,) - /„(X0| > ej < 2ncxp [-^^ 



< S. 



□ 



B. Proof of Lemma 2 

Lemma 2 follows as a corollary to Lemma 10 below. 

Lemma 10. Consider a subset A of R'^ such that there 
exists r, satisfying 

1 



^f{x) < fix') < 2fix). 



yx e A, \\x - x'W < 2r =^ 
Assume G X n A. We have 

P (l'kAX^) > 2'/''rkiX,) I rfe(X,) < 2-'/'r) 

< cxp(-fc/12), 

P (rfe,„(X,) < 2-3/''rfc(X,) I TkiX,) < 2-3/''r) 

< cxp(-fc/192). 



Proof. Let e X, and fix = x £ A such that rk{x) < 
We automatically have 



- vol {B{x, rk{x))) fix) < J- {B{x, rk{x))) 



< 2Yol{B{x,rk{x)))fix). 



We similarly have 



^ (b(x-, 23/^^,(2;))) > vol 23/''rfc(x)) 



> 8vol(B(x,rfc(a;))) 



>2J-(B(x,rfc(x))) = 2- 



Again, similarly 



2 2n 
Thus by Lemma 8, 

P{rk,n{x) > 23/'^rfe(x)) < 

(i?(x,23/V(x))) < ^ < i^(B(a;,23/^rfc(x))) 

< cxp (^-{n - 1) J- (^B(x, 23/''rfe(x))) /I2) < cxp (-fc/12) , 
and 

p(r,,„(a-)<2-3/'*rfe(a;)) < 

P (i?(.T,2-3/V(x))) > ^ > 2-F (B(x,2-3/^,(a;))) 

< cxp(-(n- (^B(x,2-3/'^rfc(2;))) /s) <cxp(-fc/192) 

Conclude by integrating these probabilities over possible 
values of Xi ~ X ^ A. □ 

Proof of Lemma 2. Part (a) follows directly from the 
Holder assumption on /. For part (b), notice that 

k 

sup Vdrf{x)X < inf J" (B(a-, rfe(x))) = - 

so that supj.g£^ rk{x) < 2~'^/''r for the setting of k. Now 
using part (a) again we have for all x ^ C\ 

vA{x)-l^<F{B{x,rk{x))) = ^, 
so ru{x) < (2fc/wrfn/(a;))i/'i. 

Finally, the probabiUstic statement is obtained by applying 
Lemma 10 and a union-bound over X n £a- □ 



